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ABSTRACT 

Discriminative segmental models, such as segmental con¬ 
ditional random fields (SCRFs) and segmental structured sup¬ 
port vector machines (SSVMs), have had success in speech 
recognition via both lattice rescoring and first-pass decoding. 
However, such models suffer from slow decoding, hampering 
the use of computationally expensive features, such as seg¬ 
ment neural networks or other high-order features. A typical 
solution is to use approximate decoding, either by beam prun¬ 
ing in a single pass or by beam pruning to generate a lattice 
followed by a second pass. In this work, we study discrimina¬ 
tive segmental models trained with a hinge loss (i.e., segmen¬ 
tal structured SVMs). We show that beam search is not suit¬ 
able for learning rescoring models in this approach, though 
it gives good approximate decoding performance when the 
model is already well-trained. Instead, we consider an ap¬ 
proach inspired by structured prediction cascades, which use 
max-marginal pruning to generate lattices. We obtain a high- 
accuracy phonetic recognition system with several expensive 
feature types: a segment neural network, a second-order lan¬ 
guage model, and second-order phone boundary features. 

Index Terms — segmental conditional random field, 
structured prediction cascades, phone recognition, segment 
neural network, beam search 

1. INTRODUCTION 

Segmental models have been considered for speech recogni¬ 
tion as an alternative to frame-based models such as hidden 
Markov models (HMMs), in order to address the shortcom¬ 
ings of the frame-level Markov assumption and introduce 
expressive segment-level features. Segmental models in¬ 
clude segmental conditional random fields (SCRFs) (TJ, 
or semi-Markov conditional random fields |j2) ; segmental 
structured support vector machines (SSVMs) 0; and gen¬ 
erative segmental models 0(2]. Previous work comparing 
segmental model training algorithms has shown some ben¬ 
efits of discriminative segmental models trained with hinge 
loss (SSVM-type learning) (6), and we consider this type of 
model here. 

Discriminative segmental models have allowed the explo¬ 
ration of complex features, both at the word level (TJ and at 


the phone level 019] 0]. These powerful segmental features 
are a double-edged sword—on the one hand, the model be¬ 
comes more expressive; on the other, it is computationally 
challenging to decode with and train such models. For this 
reason, SCRFs iflOl and SSVMs 0 were initially applied to 
speech recognition in a multi-pass approach, where the seg¬ 
mental model considers only a subset of the hypothesis space 
contained in lattices generated by HMMs. Much effort has 
been devoted to removing the dependency on HMMs and in¬ 
stead developing first-pass segmental models ijH] [9,, 12]. 
However, working with the entire hypothesis space imposes 
an even larger burden on inference, especially when the fea¬ 
tures are computationally intensive or of high order. 

If we wish to consider the entire search space in decod¬ 
ing, we can only afford features of low order or of specific 
types as in 0. An alternative approach to the problem is to 
use approximate decoding. There are two widely used ap¬ 
proximate decoding algorithms: beam search and multi-pass 
decoding. In the intuitive and popular beam search, the idea is 
to prune as we search along the graph representing the search 
space. It has been used for decoding in almost all HMM sys¬ 
tems, and for generating lattices as well. Though popular, it 
offers no guarantees about its approximation. In the category 
of multi-pass decoding, lattice and n-best list rescoring 03 
are commonly used alternatives. 

We focus on a particular type of multi-pass approach 
based on structured prediction cascades 03. which we term 
discriminative segmental cascades. A cascade is a general 
approach for decoding and training complex structured mod¬ 
els, using a multi-pass sequence of models with increasing 
order of features, while pruning the hypothesis space by a 
multiplicative factor to counteract the growth in feature com¬ 
putation. In this approach, the hypothesis space in each pass 
is pruned with max-marginals, which offers the guarantee 
that all paths with scores higher than the pruning threshold 
are kept. 

Applying the discriminative segmental cascade approach 
to speaker-independent phonetic recognition on the TIMIT 
data set, we obtain a first-pass phone error rate of 21.4% 
with a unigram language model, and a two-stage cascade er¬ 
ror rate of 19.9%, which includes a bigram language model, 
a segment neural network classifier, and second-order phone 


boundary features. This is to our knowledge the best result 
to date with a segmental model. In the following sections we 
define the discriminative segmental models we consider, de¬ 
scribe how we represent a cascade of hypothesis spaces with 
a finite-state composition-like operation, present discrimina¬ 
tive segmental cascades for decoding and training with max- 
marginal pruning, and discuss our experiments. 

2. DISCRIMINATIVE SEGMENTAL MODELS 

A linear segmental model for input space X and hypothesis 
space y is defined formally as a pair (9,cp), where 6 £ 
is the parameter vector and cp : X x y —► is the feature 

vector. For an input x £ X, each hypothesis y £ y is asso¬ 
ciated with a score 0 T cp[x, y), and the goal of decoding is to 
find the hypothesis that maximizes the score, 

argmax 9 T cp(x,y). (1) 

y&y 


general, the model can be trained with different losses. The 
model is an SCRF if we train it with log loss — logp(y\x) 
where p(y\x) oc exp(0 T cp(x, y)). It is a segmental structured 
SVM if we use the structured hinge loss: 


f'hingc (9) max 
y'ey 


cost (y,y') - 9 T cp(x,y)+ 9 T cp(x,y') , 

( 2 ) 


where cost : y x y —> [0, oo) measures the badness of a 
hypothesis path y' compared with the ground truth y. 

The loss can be optimized with first-order methods, 
such as stochastic gradient descent (SGD). The gradient 
(or subgradient, in this case) computation typically involves 
a forward-backward-like algorithm. For example, the subgra¬ 
dient of the hinge loss is 


V e 4i„ g e(0) = -00, y ) + 0O> y), (3) 


where computing the cost augmented path 


For speech recognition, we formally define the hypothesis 
space y in terms of finite-state transducers (FST). Let £ be 
the label set (e.g., the phone set in phone recognition), and 
£ = £ U {e}, where e is the empty label. Define a decoding 
graph as a standard FST G = ( V, E, I, F, w, i, o), where V 
is the set of vertices, E C V x V is the set of edges, I C V 
is the set of initial vertices, F C U is the set of final vertices, 
w : E —> R is a function that associates a weight to an edge, 
i : E —» £ is a function that associates an input label to an 
edge, and o : E —> £ is a function that associates an output la¬ 
bel to an edge. In addition to the standard definition of FSTs, 
we equip G with a function t : V —> R that maps a vertex to a 
time stamp. For any edge (u, v) £ E, let tail((u, v)) = u, and 
head((u, v)) = v. For convenience, we will use subscripts to 
denote components of a particular FST, e.g.. Eg is the edge 
set of G. 

For an input utterance, let x be the sequence of acoustic 
feature vectors. We construct a decoding graph G from x, 
then define our hypothesis space y C 2 E to be the subset of 
paths that start at an initial vertex in I and end at a final vertex 
in F. A path y £ y of length to is a sequence of unique edges 
{ei,..., e m }, satisfying head(ej) = tail(ej+i) for i £ [to]. 
Given a model (9, cp), for each edge e £ E, the weight w(e) 
is defined as 0 T <p(x, e). For convenience, for a path y £ y, 
we overload <p and w and define cp(x, y) = Y^eey 00, e ) anc ^ 

w(y) — 9 T cp(x, y) = Yleey w ( e )’ where we treat a path y as 
a set of (unique) edges e. 

If the decoding graph is the full hypothesis space with all 
possible segmentations and all possible labels, for example 
the graph on the left in FigureQ] then the model is a first-pass 
segmental model. Otherwise, it is a lattice rescoring model. 
By the above definitions, inference (decoding) in the model 
<DJ can be solved with a standard shortest-path algorithm. 

The model parameters 9 can be learned by minimizing the 
sum of loss functions on samples (x, y) in a training set. In 


V = argmax cost(y, y') + 9 T <p(x , y'), (4) 

y'&y 

requires a forward pass over the graph. Compared to comput¬ 
ing the gradient of other losses, which requires more forward 
passes and backward passes, hinge loss has computational ad¬ 
vantages, and has been shown to perform well sa, so we will 
use hinge loss for the rest of the paper. 

3. HIGH-ORDER FEATURES 
AND STRUCTURED COMPOSITION 

The order of a feature is defined as the number of labels on 
which it depends. A feature is said to be a first-order feature 
if it depends on a single label, a second-order feature if it 
depends on a pair of labels, and so on. Features with no label 
dependency are called zeroth-order features. 

High-order features in sequence prediction can be ex¬ 
tended from low-order ones by increasing the number of 
labels considered. Formally for any label set £ and any fea¬ 
ture vector <p £ M ,/ , the feature vector lexicalized with a label 
s £ £ is defined as cp ® l s , where l s is a one-hot vector of 
length |£| for the label s and 0 : R mxrl x W X( i ->■ 
is the outer product. With a slight abuse of notation, we let 
cp 0 s = <p 0 l s . The resulting vector is of length |£|d Sim¬ 
ilarly, we can lexicalize a feature vector with pairs of labels, 
00Si 0 S 2 = 00 Isi 0 1 . 5 2 , giving a vector of length |£| 2 d. 

For example, a common type of zeroth-order segmental 
feature is of the form ip(x,ti,t2) where x is the sequence 
of acoustic feature vectors, t\ is the start time of the seg¬ 
ment, and 4 is the end time of the segment. To make it 
discriminative in a decoding graph H, we can compute the 
first-order feature <p H (x, e) for any edge e by first computing 
ip(x, f(tail(e)), f(head(e))) and then lexicalizing it with the 
label o#(e). 




O b_b o 
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O a -»-0 

7?2 °cr L 2 

Fig. 1. From left to right: An example of the full hypothesis space H \ with four frames (five vertices) and three unique labels 
{a, b, c} (three edges between every pair of vertices) with segment length up to three frames (actual labels omitted for clarity); 
H- 2 , a pruned II \; a graph structure corresponding to a bigram language model L 2 over three labels; and H 2 cr-composed with 
L 2 , where Si_S 2 denotes the bigram s\S 2 - 



To have a unified way of extending the order of features, 
we define the concept of FST structured composition, or cr- 
composition for short, as follows. For any two FSTs A and 


B, the cr-composed FST is defined as 

G = Ao a B (5) 

where 

V g = Tax V b (6) 

Eg = |(ei, e2) £ Ea x Eb ■ OA(e 1 ) = is(e2)| (7) 

and 

*G((ei,e 2 )) = iA(ei) (8) 

OG((ei,e 2 )) = o B (e 2 ) (9) 

tail G ((ei,e 2 )) = (taiU(ei), tail s (e 2 )) (10) 

head G ((ei,e 2 )) = (head^ei), head s (e 2 )) (11) 


where (•, •) denotes a tuple. Unlike in classical composition, 
we only constrain the structure of G and are free to define u> G 
differently. In particular, we let 

w G ((e 1 , e 2 )) = 0 G 4 > G (x, (ei, e 2 )), ( 12 ) 

and <p G is free to use <p A and cj) B but is not constrained to do 
so. In other words, the weight function wg can extract richer 
features than wa and w B - 

With structured composition, we can easily convert low- 
order features to high-order ones. Continuing the above ex¬ 
ample, we can cr-compose the decoding graph H with a bi¬ 
gram language model (LM) L in its FST form [ 15] with a 
slight modification. We require the output labels of the LM 
FST to include the history labels alongside the current label. 
For example, the output labels of a bigram LM are of the 
form S 1 S 2 £ E x E, where si is the history label (possi¬ 
bly e) and S 2 is the current label. Let G = H o a L. We 
can define f G ((e 1 , 62 )) = t B {e 1 ). For an edge e £ E G , 
we can compute first-order features ip ® Si, and second-order 
features ip <g> Si (g> S 2 for S 1 S 2 = o G (e) and si e, where 
ip = V^MG(tail G (e)),f G (head G (e))). If si = e, every¬ 
thing falls back to the previous example. In general, by er- 
composing with high-order n-gram LMs, we can compute 
high-order features by lexicalizing low-order ones. 


4. DISCRIMINATIVE SEGMENTAL CASCADES 


Our approach, which we term a discriminative segmental cas¬ 
cade (DSC), is an instance of multi-pass decoding, consisting 
of levels with increasing complexity of features and decreas¬ 
ing size of search space. We start with the full search space 
and a “simple” first-level discriminative segmental model 
using inexpensive features, and use the first-level model to 
prune the search space. We then apply a model using more 
expensive features, and optionally repeat the process for as 
many levels as desired. Rather than the typical beam pruning, 
we prune with max-marginals llT6l [T4l . which have certain 
useful properties and turn out to be important for achieving 
good performance with our models. A max-marginal of an 
edge e in G is defined as 

7 (e) = max 6 T (j)(x,y). (13) 


In words, it is the highest score of a path that passes through 
the edge e. We prune the edge if its max-marginal is lower 
than a threshold, and keep it otherwise. In order to prune 
a multiplicative factor of edges at each level of the cascade, 
Weiss et al. Ifl4l propose to use the threshold 


t\ = ( 1 - A)— 7 (e) + X max G T (f)(x,y), (14) 

\ Eg \ yey 


which interpolates between the mean of the max-marginals 
and the maximum. If A is set to 1, we only keep the best path. 

Lattice generation by max-marginal pruning guarantees 
that there is always at least one path left after pruning and 
that any y satisfying w(y) > t\ is kept, because for every 
e £ y, 7 (e) > w(y) > t\. In particular, if the ground truth 
has a score higher than the threshold, it will still be in the 
search space for the next level of the cascade. 

Computing max-marginals in a specific level of the cas¬ 
cade requires a forward pass and a backward pass through the 
graph. Pruning with max-marginals thus takes twice the time 
as searching for the best path alone. 

Learning the cascade of models is also done level by level. 
We start with the entire hypothesis space Hi limited only by 
a maximum segment length. A first set of computationally 
inexpensive features up to first order is used for learning. Let 






neural network (CNN) ED, which we describe next. 


Table 1. A summary of results in terms of phonetic error 
rate (%) on the TIMIT test set, for prior first-pass segmental 
models, a speaker-independent HMM-DNN system given by 
a standard Kaldi recipe HI 8 l . and our models. 



dev 

test 


PER (%) 

PER (%) 

HMM-DNN 


21.4 

first first-pass SCRF ( 8 | 


33.1 

Boundary-factored SCRF [9) 


26.5 

Deep segmental NN [j 11] 


21.87 

our first-pass model (Hi) 

22.15 

21.73 

DSC 2 n7r level with bigram LM 

19.80 


+ 2 nd-order boundary features 

19.22 


+ 1 st-order segment NN 

18.86 


+ 1 st-order bi-phone NN bottleneck 

18.77 

19.93 


the first set of weights learned be 9 i. We can use 9 1 for first- 
pass decoding if it is good enough, or we can choose to gen¬ 
erate the next level of the cascade and use more computation¬ 
ally expensive features, such as higher-order ones. Moving 
to the next level of the cascade, we compute max-marginals 
with 9 1 and prune Hi with a threshold, resulting in a lattice 
11 : 2 . If we wish increase the order of features, we er-compose 
//2 with a bigram LM L- 2 - A second set of features up to sec¬ 
ond order can then be used for learning. Suppose the second 
set of weights is 9 2 . Again, we have the choice either to use 
9 2 for decoding or to prune and repeat the process with more 
computationally expensive features. 

5. EXPERIMENTS 

We experiment with segmental models in the context of pho¬ 
netic recognition on the TIMIT corpus 8T71 . We follow the 
standard TIMIT protocol for training and testing. We use 192 
randomly selected utterances from the complete test set other 
than the core test set as our development set, and will refer to 
the core test set simply as the test set. The phone set is col¬ 
lapsed from 61 labels to 48 before training. In addition to the 
48 phones, we also keep the glottal stop /q/, sentence start, 
and sentence end so that every frame in the training set has 
a label. A summary of prior first-pass decoding results with 
segmental models, along with our results and one from a stan¬ 
dard speaker-independent HMM-DNN, is shown in Table [T] 

5.1. First-pass segmental model 

First we demonstrate the effectiveness of our first-pass de¬ 
coder. The first-pass search graph, denoted H 1 , contains all 
possible labels and all possible segmentations up to 30 frames 
per segment. Like some prior segmental phonetic recognition 
models mm, many of the features in our first-pass decoder 
are based on averaging and sampling the outputs of a neural 
network phonetic frame classifier, specifically a convolutional 


5.1.1. CNN frame classifier 

The input to the network is a window of 15 frames of log- 
mel filter outputs. The network has five convolutional layers, 
with 64-256 filters of size 5x5 for the input and 3x3 for 
others, each of which is followed by a rectified linear unit 
(ReLU) lf20l activation, with max pooling layers after the first 
and the third ReLU layers. The output of the final ReLU layer 
is concatenated with a window of 15 frames of MFCCs cen¬ 
tered on the current frame, and the resulting vector is passed 
through three fully connected ReLU layers with 4096 units 
each. The network is trained with SGD for 35 epochs with a 
batch size of 100 frames. Fully connected layers and the con¬ 
catenation layer are trained with dropout at a 20% and 50% 
rate, respectively. This classifier was tuned on the develop¬ 
ment set and achieves a 22 . 1 % frame error rate (after collaps¬ 
ing to 39 phone labels) on the test set. We will use CNN(x, t ) 
to denote the log of the final softmax layer, corresponding to 
the predicted log probabilities of the phones, given as input 
[■ x t-7 ; • ■ •; %t+ r]- 

5.1.2. First-order features 

Below we list the features for each edge (it, v). We will use 
L = t(v) — t(u) for short. 

Average of CNN log probabilities The log of the CNN out¬ 
put layer is averaged over all frames in the segment: 

1 L_1 

-J^CNN (x,t(u)+i) (15) 

i=0 

Samples of CNN log probabilities The log of the CNN out¬ 
put layer is sampled from the middle frames of three equally 
split sub-segments, i.e., 

CNN (x,t(u) + 

for k = 0 , 1 , 2 . 

Boundary features The log probabilities i frames before the 
left boundary CNN (x,t(u) — i ) and i frames after the right 
boundary CNN(x, t(v) + i ) are used as features. We use the 
concatenation of the boundary features for * = 1 , 2 ,3. 

Length indicator 1 / J= / for £ = 0,1,..., 30. 

Bias A constant 1. 

We lexicalize all of the above features to first order, 
and include a zeroth-order bias feature. We minimize hinge 
loss with the overlap cost function introduced in [ 6 ] with 
AdaGrad for up to 70 epochs with step sizes tuned in 


[k + (/c + 1)]L 
3~2 


(16) 
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Fig. 2. Beam search on H\ with different beam widths. Left: 
Hit rate on the development set. Right: PER on the develop¬ 
ment set. The dashed line is the PER of the exact search. 

{0.01,0.1,1}. No explicit regularizer is used; instead we 
choose the step size and iteration that perform best on the 
development set (so-called early stopping). As shown in Ta- 
ble|T} our first-pass segmental model outperforms all previous 
segmental model TIMIT results of which we are aware. 

5.2. Higher-order features and segmental cascades 

We next explore multi-pass decoding with beam search and 
with discriminative segmental cascades. In the second pass 
we include features of order two and a bigram LM L 2 . Back¬ 
off is approximated with e transitions in the bigram LM. Let 
G = II o a 1 , 2 , where II can be H 1 or Ii>, the pruned Hi. We 
consider the following additional features on edges e € Eg- 

Bigram LM score The bigram log probability logpLM(s 2 |si), 
where sis 2 = oc(e) We do not lexicalize this feature because 
it is naturally second-order. 

5.2.1. Beam search 

Before experimenting with the second-order features, we 
compare beam search and exact search on the best model 
for Hi to give a sense of the approximation quality of beam 
search. We measure the quality of approximation via the “hit 
rate”, i.e., how often the exact best path is found. Results 
are shown in Figure [2] As expected, the hit rate decreases 
as the beam width decreases. However, the PER does not 
decrease significantly, which demonstrates that beam search 
is a good approximate decoding algorithm when the model is 
well-trained. 

Judging from the decoding results, we use beam search 
with beam widths {10, 20, 30} for learning. Since the runtime 
of beam search is controlled by the beam width when the de¬ 
coding graph is large, we can search directly on iTio cr L 2 - The 
composition is done on the fly to avoid enumerating all edges 
in Hi o a L 2 . We compare learning on both Hi and Hi o a L 2 . 
For Hi we use the same features as the first-pass segmental 
model, while for Hi o a L 2 we add the bigram LM score and 
second-order boundary features. For consistency, we use the 
same beam width for decoding. Hinge loss is minimized with 
AdaGrad with step sizes tuned in {0.01, 0.1, 1}. Results are 


Fig. 3. Beam search for learning with different beam widths: 
-*-beam=10 beam=20 beam=30 —exact. Top: 
Learning on Hi. Bottom: Learning on Hi o a L 2 . The 
dashed line is the learning curve of the second-level cascade 
H 2 °a T/2 ■ 

shown in Figure [3] for the step size that achieves the lowest 
development set PER. When we train the segmental model on 
Hi (top of Figure [3]), learning with beam search is success¬ 
ful when the beam width is large enough, while for Hi o a L 2 
(bottom of Figure^, learning completely fails. 

5.2.2. Discriminative segmental cascades (DSC) 

We next consider the proposed discriminative structured cas¬ 
cades (DSC) for utilizing the bigram LM and second-order 
features. We first prune Hi with max-marginal pruning using 
our first-pass segmental model with weights 6 1 , resulting in 
H 2 , and cr-compose II 2 with L 2 . Recall that the larger the 
pruning parameter A, the sparser the lattice. We measure the 
density of the lattice by the number of edges in H 2 divided by 
the number of ground-truth (gold) edges. The quality of // 2 ’s 
produced with different A’s is shown in Figure [4] (left). For 
the DSC second level, we define an additional feature: 

Lattice score Instead of re-learning all of the weights for 
the features in the first-pass model, we combine them into 
an additional feature from the first level of the cascade 
f> l (f) Ui (x. C]), which is never lexicalized, where e-\ G II 
is such that (ei, ef) G Eg- 

To compare with beam search, we use the lattice score, the 
bigram LM score, second-order boundary features, first-order 
length indicators, and first-order bias as our features for the 
second level of the cascade. Hinge loss is minimized with 
AdaGrad for up to 20 epochs with step sizes optimized in 
{0.01,0.1}. Again, no explicit regularizer is used except early 
stopping on the development set. Learning results on different 
lattices are shown in Figure |4] (right). We see that learning 
with the DSC is clearly better than with beam search. 

5.2.3. Other expensive features 

To add more context information, we use the same CNN ar¬ 
chitecture and training setup to learn a bi-phone frame classi¬ 
fier, but with an added 256-unit bottleneck linear layer before 













lattice density lattice density 

(edges per gold edge) (edges per gold edge) 

Fig. 4. Quality of H 2 for A’s in {0.8, 0.7, 0.6, 0.5}. Left. 
Oracle error rates for different lattice densities. Right : Cor¬ 
responding second-pass development set PERs? 


Table 2. TIMIT segment classification error rates (ER). 


test ER (%) 

Gaussian mixture model (GMM) 11231 

26.3 

SVM ES 

22.4 

Hierarchical GMM lf22l 

21.0 

Discriminative hierarchical GMM |[24l 

16.8 

SVM with deep scattering spectrum 1251 

15.9 

our CNN ensemble 

15.0 


the softmax ED- Each frame is labeled with its segment label 
and one additional label from a neighboring segment. If the 
current frame is in the first half of the segment, the additional 
label is the previous phone; if it is in the second half, then 
the additional label is the next phone. The learned bottleneck 
layer outputs are used to define features (although they do not 
correspond to log probabilities) with averaging and sampling 
as for the uni-phone case. We refer to the resulting features as 
bi-phone NN bottleneck features. 

Finally, we also use the same type of CNN to train a 
segment classifier. Here the features at the input layer are 
the log-mel filter outputs from a 15-frame window around the 
segment’s central frame. The network architecture is the same 
as our frame classifier, but instead of concatenation with 15- 
frame MFCCs, we concatenate with a segmental feature vec¬ 
tor consisting of the average MFCCs of three sub-segments 
in the ratio of 3-4-3, plus two four-frame averages at both 
boundaries and length indicators for length 0 to 20 (similar 
to the segmental feature vectors of ll22l l23l f. This CNN is 
trained on the ground-truth segments in the training set. Fi¬ 
nally, we build an ensemble of such networks with different 
random seeds and a majority vote. This ensemble classifier 
has a 15.0% classification error on the test set, which is to our 
knowledge the best result to date on the task of TIMIT phone 
segment classification (see Table |2}. 

It is, however, still too time-consuming to compute the 
segment network outputs for every edge in the lattice. We in¬ 
stead compress the best-performing (single) CNN into a shal¬ 
low network with one hidden layer of 512 ReLUs by training 
it to predict the log probability outputs of the deep network, 
as proposed by |26j, 1271 . We then use the log probability out¬ 


puts of the shallow network and lexicalize them to first order. 
We refer to the result as segment NN features. 

Results with these additional features are shown in Ta¬ 
ble U| Adding the second-order features, bigram LM, and the 
above NN features gives a 1.8% absolute improvement over 
our best first-pass system, demonstrating the value of includ¬ 
ing such powerful but expensive features. 

6. DISCUSSION 

We have presented discriminative segmental cascades (DSC), 
an approach for training and decoding with segmental mod¬ 
els that allows us to incorporate high-order and complex fea¬ 
tures in a coarse-to-fine approach, and have applied them to 
the task of phone recognition. The DSC approach uses max- 
marginal pruning, which outperforms beam search for learn¬ 
ing the second-pass model. Starting from a first-pass large- 
margin model that outperforms previous segmental model re¬ 
sults and is competitive with HMM-DNNs, the DSC second 
pass improves the phone error rate by another 1 . 8 % absolute. 

Further analysis may be needed to understand precisely 
why learning with beam search is not successful in the context 
of our models. One issue is that cr-composing H 1 and L 2 
introduces many dead ends (paths that do not lead to final 
vertices) in the graph because we have to do the composition 
on the fly. Minimizing Hi o a L 2 might help, but we would 
need to touch the edges of Hi o a L 2 at least once, which is 
itself expensive. Second, even if we reach the final vertices, 
the cost-augmented path might still have a lower cost+score 
than the ground-truth path, which leads to no gradient update. 
This issue has been studied recently, and one possible solution 
is “premature updates” l28l . but these are intended for the 
perceptron loss. Third, the edge weights in our models are 
not strictly negative. Beam search would tend to go depth-first 
when encountering edges with positive weights. On the other 
hand, if the edge weights are negative, beam search would 
tend to go breadth-first, which may explain why greedy search 
like beam search may cause problems for segmental models 
but works for HMMs. 

Additional future work includes considering even more 
expressive features, higher-order features and additional cas¬ 
cade levels. There is also much room for exploration with 
segment neural network classifiers. One concern with our 
segment classifiers is that they are trained only with ground 
truth segments, so it is unclear how they behave when the 
input is an incorrect hypothesized segment. Alternatives in¬ 
clude training on all hypothesized segments and allowing the 
network to learn to classify non-phones, similarly to the anti¬ 
phone and near-miss modeling of Q. 
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